Skip to content

Fix ports.isEmpty() reliability gap: make request correlation survive scheduler hops#313

Open
Mishenevd wants to merge 3 commits into
fix/webclient-outbound-trackingfrom
fix/webclient-scheduler-hop
Open

Fix ports.isEmpty() reliability gap: make request correlation survive scheduler hops#313
Mishenevd wants to merge 3 commits into
fix/webclient-outbound-trackingfrom
fix/webclient-scheduler-hop

Conversation

@Mishenevd

@Mishenevd Mishenevd commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

Context

Follow-up to #312. Hans's review there flagged ports.isEmpty() (used to distinguish real requests from infra noise in DNSRecordCollector) as not fully reliable. #312 fixed the one concrete case found at the time (WebClient redirects) and left the general problem as a documented, unverified hypothesis.

This PR closes that gap for real: Context/PendingHostnamesStore were ThreadLocal, which only survives when a WebClient call's "intent" (registering the pending host+port) and its "effect" (the actual socket connect) run on the same OS thread. That holds for the default WebFlux request-handling flow, but breaks under:

  • An app's own explicit scheduler switch (.publishOn(Schedulers.boundedElastic()) before a WebClient call - a common pattern for mixing blocking JDBC with reactive controllers).
  • Reactor Netty's own internal event-loop dispatch (subscribing thread → reactor-http-nio thread) - separate from any app-level scheduler hop.

Confirmed via e2e before investing in a fix, not just reasoned about on paper. Added a test endpoint (/api/request/publish-on, Mono.just(url).publishOn(Schedulers.boundedElastic()).flatMap(...)) targeting the same private IP (169.254.169.254) used elsewhere in #312's e2e tests:

  • Without the hop: blocked, HTTP 500, SSRF Attack detected due to: 169.254.169.254:80.
  • With the hop: not blocked - URLCollector registered intent on thread boundedElastic-1, DNSRecordCollector.reportConnect() ran the check on thread reactor-http-nio-5, no SSRF log line at all, the request actually went out over the network (curl hung until timeout).

What changed

  • PendingHostnamesStore.java - rewritten from ThreadLocal to a global, synchronized, bounded-LRU map (same 1000-entry cap and eviction policy as before). Each entry now also carries the ContextObject that was active when it was registered, not just the ports.

    Trade-off: two concurrent requests to the same hostname can now share/overwrite each other's entry in a narrow race window. Doesn't open an SSRF bypass - SSRFDetector/StoredSSRFDetector still run unconditionally either way - worst case is a wrong source attribution for that one request. This is a real, demonstrated effect and not just theoretical: it broke RestTemplateTest/InetAddressTest during development (turned out to be an unrelated reflection bug this change exposed, see below, now fixed) - worth keeping an eye on if dashboard source attribution ever looks off for high-traffic shared hostnames.

    Considered an object-identity-keyed alternative (key by the Reactor Netty request-handler object instead of hostname, avoiding the cross-contamination trade-off entirely) but ruled it out: that object isn't reachable from SpringWebClientWrapper's hook point (ExchangeFunction.exchange(), Spring's own layer, which runs before Reactor Netty constructs its own internal handler for the call).

  • WebRequestCollector.java - removed the per-incoming-request PendingHostnamesStore.clear() call. It was a ThreadLocal-era safety net ("wipe this thread's leftover entries before handling a new request") that becomes actively harmful now that the store is global - it would wipe other concurrent requests' in-flight entries, not just stale ones from a previous request on a reused thread. The bounded-LRU eviction already handles unbounded growth on its own.

  • ReactorAikidoContext.java (new) - bridges the captured ContextObject through Reactor's own Context (reactor.util.context.Context/ContextView - not our own Context class) so it survives .publishOn() between the incoming request and a WebClient call made while handling it. SpringWebfluxWrapper writes it via .contextWrite(); SpringWebClientWrapper reads it back via .deferContextual() at ExchangeFunction.exchange() time and passes it into PendingHostnamesStore alongside the port. DNSRecordCollector.report()/reportConnect() temporarily restore that captured context around the SSRF check, so it's used regardless of which thread actually ends up processing the connect.

    Everything in this class goes through reflection rather than a static type reference to reactor.util.context.*. reactor-core is compileOnly for this module: unlike @Advice-bound parameter types (which ByteBuddy resolves specially against the woven target's own classloader), a normal helper class that declares Context/ContextView as a concrete parameter type fails to verify at all on the agent's own classloader, which has no visibility into the target app's classpath - confirmed by hitting exactly that NoClassDefFoundError on the first attempt. A second attempt using lambdas for the Function callbacks hit a different problem: ByteBuddy inlines @Advice bytecode into the target class, so a lambda constructed there becomes a private synthetic method that the target class can't call back into (IllegalAccessError) - same class of bug already hit once in Track and protect WebClient outbound requests, fix private-IP SSRF regression #312 with SpringWebClientRedirectWrapper's externalForm helper. Fixed with a plain, public, Object-typed InvocationHandler class instead of a lambda.

  • Considered io.micrometer:context-propagation first (Micrometer's ThreadLocalAccessor API, purpose-built for this). Dropped it: it only helps if the target app already happens to have that optional library on its classpath, which can't be required or forced - the agent can't add a hard dependency that isn't guaranteed present. The fix above needs nothing beyond reactor-core, which is guaranteed present wherever WebClient is.

  • Incidental bugfix found while testing this: HttpURLConnectionWrapper and HttpClientSendWrapper locate URLCollector.report via reflection, matching by method name only (clazz.getMethods() + .getName().equals("report")). Adding the new report(URL, ContextObject) overload made this ambiguous - depending on JVM method-ordering, it could silently pick the wrong overload, throw an IllegalArgumentException on the argument-count mismatch, and (since both wrappers run with suppress = Throwable.class) fail completely silently, breaking outbound tracking/SSRF for plain HttpURLConnection/java.net.http.HttpClient users. Caught by RestTemplateTest/InetAddressTest failing during development. Fixed by looking up the exact single-argument overload (clazz.getMethod("report", URL.class)) instead of matching by name.

e2e scenario tested

Same setup as #312 (real agent, real Spring WebFlux sample app, mock core):

SSRF via WebClient after an explicit scheduler hop, now blocked:

curl -G "http://localhost:8090/api/request/publish-on" --data-urlencode "url=http://169.254.169.254/latest/meta-data/"

→ HTTP 500, SSRF Attack detected due to: 169.254.169.254:80, source=query. Log confirms the cross-thread path: URLCollector registers intent on thread boundedElastic-1, DNSRecordCollector.reportConnect() runs the check on thread reactor-http-nio-5 - different threads, correctly correlated.

Also re-ran all of #312's existing e2e scenarios (safe request, literal-IP SSRF, dual-stack, redirect) to confirm no regressions - all still pass.

Tests

New PendingHostnamesStoreTest cases:

  • testEntryIsVisibleFromADifferentThread - the actual point of making the store global: writes on one thread, reads on another, confirms the entry is visible.
  • testGetContextReturnsWhatWasCapturedAtAddTime / testGetContextIsNullWhenNoneWasCaptured - the captured-ContextObject half of an entry.
  • Renamed testUnboundedHostnamesDoNotGrowThreadLocalMapForevertestUnboundedHostnamesDoNotGrowMapForever (no longer thread-local).

Full regression pass: 1036 tests, same 42 pre-existing environmental failures (network/DB dependent, unavailable in sandbox) as #312's baseline, zero new failures.

New OkHttpTest.testSSRFAsyncEnqueueOnDifferentThread - a separate customer report surfaced the same class of bug via OkHttpClient: newCall() registers intent on the calling thread, but call.enqueue() (async) runs the actual connect on OkHttp's own Dispatcher thread pool, a different thread. This PR's PendingHostnamesStore fix is generic, not Spring/Reactor-specific, so it turns out to already cover this case with no OkHttp-specific changes needed - confirmed by running the same test against the base branch first (fails: real network call goes out unblocked) and then this branch (passes).

Addendum: tried to close a #312 review comment fully, found a real limit

Hans's review on #312 also asked for something matching HttpURLConnectionTest's pattern - a single JUnit test where a real client call to a private IP gets blocked, no splitting needed. #312 couldn't do that for WebClient (had to split WebClientTest into two tests) because of the ThreadLocal thread-hop problem this PR fixes - so tried writing that single test here, assuming it'd now work.

It doesn't, and not for a reason this PR can fix: ReactorAikidoContext's taint-context capture only gets written into Reactor's own Context by SpringWebfluxWrapper, which only fires for a real incoming WebFlux request. A bare webClient.get().uri(privateIP)...block() call made directly from a JUnit test, with no request behind it, never populates it - confirmed empirically, the request actually went out over the network and hung until timeout instead of getting blocked. Unlike HttpURLConnectionTest (whose SSRF check never depended on any request-scoped context propagation beyond a plain ThreadLocal on the same thread), this fix inherently needs a real Spring request context to exist.

The correct test for "WebClient in a Spring app gets blocked" is what already exists: spring_webflux_postgres.py's ssrf payload (from #312) plus this PR's own scheduler-hop e2e scenario, both against the real running sample app. Documented this in WebClientTest.java's javadoc so it doesn't get re-litigated in review as "why isn't this just one test".

Known remaining limitations (unchanged from #312, not addressed here)

  • CRITICAL, found while investigating this PR's own CI failures: SocketChannelWrapper never fires under Netty's native epoll transport (Linux's default) - see Track and protect WebClient outbound requests, fix private-IP SSRF regression #312's worklog item 1 for the full writeup. EpollSocketChannel implements Netty's own io.netty.channel.socket.SocketChannel, not the JDK's java.nio.channels.SocketChannel our matcher targets, despite the identical simple name. WebClient traffic is completely untracked/unprotected under this transport, which is why this PR's own SpringWebfluxSampleApp e2e job fails 100% of the time in CI while passing locally on macOS (native epoll lib present but fails to load there, silent fallback to plain NIO masks the gap). Not yet fixed - candidate direction noted in Track and protect WebClient outbound requests, fix private-IP SSRF regression #312.
  • Spring's RestClient (6.1+) instrumentation status is still unverified.
  • IsPrivateIP.isPrivateIp()'s handling of exotic private-IP encodings (decimal/octal, IPv4-mapped IPv6) is still unverified.
  • Spring WebFlux still has no request-body taint tracking.

🤖 Generated with Claude Code

Summary by Aikido

Security Issues: 0 🔍 Quality Issues: 3 Resolved Issues: 0

🚀 New Features

  • Added ReactorAikidoContext to carry ContextObject through Reactor Context

🐛 Bugfixes

  • Replaced ThreadLocal store with global synchronized LRU storing context
  • Fixed reflection lookup to invoke exact URLCollector.report(URL) overload

🔧 Refactors

  • Updated WebFlux/WebClient wrappers to write/read Context via Reactor Context

More info

… scheduler hops

PendingHostnamesStore was ThreadLocal, so a WebClient call's registered
intent could get lost whenever intent and connect ran on different OS
threads - either an app's own .publishOn() or Reactor Netty's internal
event-loop dispatch. Confirmed live: an SSRF request to a private IP
went out completely unblocked when preceded by a scheduler hop.

Made PendingHostnamesStore a global bounded-LRU map instead (so it
survives any thread hop), each entry now also carrying the
ContextObject captured at registration time, bridged through Reactor's
own Context (not ours) via a new reflection-only helper so it works
without needing reactor-core as a hard runtime dependency.

Also fixes a latent bug this surfaced: two wrappers located
URLCollector.report via name-only reflection, which silently broke
once a second overload existed.

private ReactorAikidoContext() {}

// `mono` is a Mono<Void>, returned Object is that same Mono<Void> wrapped with .contextWrite().

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReactorAikidoContext uses extensive reflection, dynamic Proxy, and classloader-based lookups (write, deferRegisterUrl, RegisterUrlHandler.invoke), which obscures control flow and behavior and hinders auditing.

Details

✨ AI Reasoning
​The new helper uses runtime classloader lookups (Class.forName), reflection to call methods (getMethod/invoke), and a dynamic Proxy/InvocationHandler that inspects and invokes on opaque Objects passed from the target app's reactor classes. These constructs make the control flow and data flow non-obvious to code reviewers and static analysis tools and can be used to hide or obfuscate behavior. The code attempts to bridge reactor types without a compile-time dependency by keeping everything Object-typed and reflective, which increases opacity. Specifically, the write() and deferRegisterUrl() methods and the RegisterUrlHandler.invoke() perform many reflective operations and dynamically call into target-context objects, which can be used to execute logic that is hard to audit.

🔧 How do I fix it?
Ensure code is transparent and not intentionally obfuscated. Avoid hiding functionality from code review. Focus on intent and deception, not specific patterns.

Reply @AikidoSec feedback: [FEEDBACK] to get better review comments in the future.
Reply @AikidoSec ignore: [REASON] to ignore this issue.
More info

try {
Method getOrDefault = ctxView.getClass().getMethod("getOrDefault", Object.class, Object.class);
context = (ContextObject) getOrDefault.invoke(ctxView, KEY, null);
} catch (Throwable ignored) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty catch in RegisterUrlHandler.invoke swallowing Throwable; add logging or explicit handling so context extraction/reporting failures aren't silently ignored.

Details

✨ AI Reasoning
​A new private InvocationHandler implementation swallows all Throwables with an empty catch block. This handler runs during Reactor context extraction and URL registration, so silent failures could hide problems in request correlation. The empty catch neither logs nor documents why errors are ignored, making debugging harder and potentially masking bugs in context retrieval or URL reporting.

🔧 How do I fix it?
Add proper error handling in catch blocks. Log the error, show user feedback, or rethrow if needed.

Reply @AikidoSec feedback: [FEEDBACK] to get better review comments in the future.
Reply @AikidoSec ignore: [REASON] to ignore this issue.
More info

* application's classpath. Must be public: the woven target class (in a completely different
* package) needs to call into it directly.
*/
public final class ReactorAikidoContext {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReactorAikidoContext mixes Reactor reflection/plumbing with URLCollector.report call; split reactor-context bridging from URL registration logic.

Details

✨ AI Reasoning
​The new ReactorAikidoContext both performs low-level Reactor reflection plumbing (constructing Context objects, creating Function proxies, invoking contextWrite/deferContextual) and directly calls URLCollector.report to register URLs. This couples reactor-integration plumbing with application-level collector behavior, increasing responsibility scope and coupling.

🔧 How do I fix it?
Split classes that handle database, HTTP, and UI concerns into separate, focused classes.

Reply @AikidoSec feedback: [FEEDBACK] to get better review comments in the future.
Reply @AikidoSec ignore: [REASON] to ignore this issue.
More info

dmitrii added 2 commits July 2, 2026 15:18
… HttpURLConnectionTest

Tried writing one (webClient.get().uri(privateIP)...block(), asserting
SSRFException) assuming PendingHostnamesStore going global would make
it possible. It doesn't: the taint-context capture only gets written
by SpringWebfluxWrapper, which requires a real incoming WebFlux
request - a bare WebClient call from a test never populates it, so
the SSRF check silently no-ops and a real network call goes out
(confirmed empirically, request hung until timeout instead of being
blocked).
Hans separately found a customer report involving OkHttpClient and an
unknown-port DNS lookup - same shape as this PR's WebFlux scheduler-hop
bug, but via OkHttp's own Dispatcher thread pool instead (newCall()
registers intent on the calling thread, but call.enqueue() runs the
actual connect on a different thread from OkHttp's own pool).

Confirmed empirically rather than assumed: this PR's PendingHostnamesStore
fix is generic (not Spring/Reactor-specific), so it already covers this
case without needing anything OkHttp-specific. Verified the same test
fails on the base branch (ports lost across the thread hop, real
network call goes out unblocked) and passes here.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant